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Abstract 

Text documents are complex high dimensional objects. To effectively visualize such data it is impor- 



(N 

tant to reduce its dimensionality and visualize the low dimensional embedding as a 2-D or 3-D scatter 
C/2 I plot. In this paper we explore dimensionality reduction methods that draw upon domain knowledge 

, in order to achieve a better low dimensional embedding and visualization of documents. We consider 

the use of geometries specified manually by an expert, geometries derived automatically from corpus 
statistics, and geometries computed from linguistic resources. 

> 

00 . 

^ : 1 Introduction 
O 

^ , Visual document analysis systems such as IN-SPIRE have demonstrated their applicability in managing 

Q ■ large text corpora, identifying topics within a document and quickly identifying a set of relevant documents 

O , by visual exploration. The success of such systems depends on several factors with the most important one 

being the quality of the dimensionality reduction. This is obvious as visual exploration can be made possible 
only when the dimensionality reduction preserves the structure of the original space, i.e., documents that 
^ ■ convey similar topics are mapped to nearby regions in the low dimensional 2D or 3D space. 

Standard dimensionality reduction methods such as principal component analysis (PCA), locally linear 
embedding (LLE) 1 19], or t-distributed stochastic neighbor embedding (t-SNE) | [22| take as input a set of fea- 
ture vectors such as bag of words or tf vectors. An obvious drawback of such an approach is that such meth- 
ods ignore the textual nature of documents and instead consider the vocabulary words V = {vi, ...,?;„} as 
abstract orthogonal dimensions that are unrelated to each other In this paper we introduce a general tech- 
nique for incorporating domain knowledge into dimensionality reduction for text documents. In contrast to 
several recent alternatives, our technique is completely unsupervised and does not require any labeled data. 

We focus on the following type of non-Euclidean geometry where the distance between document x and 
y is defined as 



drix, y) = \J{x- yyT{x - y). (1) 
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Here T G M"^" is a symmetric positive semidefinite matrix, and we assume that documents x,y w& rep- 
resented as term-frequency (tf) column vectors. Since T can always be written as H for some matrix 
H G i^mx" where m < n, an equivalent but sometimes more intuitive interpretation of ([T]) is to compose 
the mapping x ^ Hx with the Euclidean geometry 

dT{x, y) = dj{Hx, Hy) = \\Hx - Hyh- (2) 

We can view T as encoding the semantic similarity between pairs of words. When H is a. square matrix, it 
smoothes the tf vector x by mapping observed words to unobserved related words. Alternatively, if m, the 
number of rows of H, equals to the number of existing topics, the mapping can be viewed as describing a 
document as a mixture of such topics. Therefore, the geometry realized by ([Til or Q may be used to derive 
novel dimensionality reduction methods that are customized to text in general and to specific text domains 
in particular. The main challenge is to obtain the matrices H or T that describe the relationship among 
vocabulary words appropriately. 

We consider obtaining H or T using three general types of domain knowledge. The first corresponds to 
manual specification of the semantic relationship among words. The second corresponds to analyzing the 
relationship between different words using corpus statistics. The third corresponds to knowledge obtained 
from linguistic resources. In some cases, T might be easier be obtain than H. Whether to specify H directly 
or indirectly through T depends on the knowledge type and is discussed in detail in Section |4] 

We investigate the performance of the proposed dimensionality reduction methods for three text do- 
mains: sentiment visualization for movie reviews, topic visualization for newsgroup discussion articles, and 
visual exploration of ACL papers. In each of these domains we compare several different domain depen- 
dent geometries and show that they outperform popular state-of-the-art techniques. Generally speaking, we 
observe that geometries obtained from corpus statistics are superior to manually constructed geometries and 
to geometries derived from standard linguistic resources such as Word-Net. We also demonstrate effective 
ways to combine different types of domain knowledge and show how such combinations significantly out- 
perform any of the domain knowledge types in isolation. All the techniques mentioned in this paper are 
unsupervised, making use of labels only for evaluation purposes. 

2 Related Work 

Despite having a long history, dimensionahty reduction is still an active research area. Broadly speaking, di- 
mensionality reduction methods may be classified to projective or manifold based ||3l. The first projects data 
onto a linear subspace (e.g., PC A and canonical correlation analysis) while the second traces a low dimen- 
sional nonlinear manifold on which data lies (e.g., multidimensional scaling, isomap, Laplacian eigenmaps, 
LLE and t-SNE). The use of dimensionality reduction for text documents is surveyed by 121] who also 
describe current homeland security applications. 

Dimensionality reduction is closely related to metric learning. ||23]| is one of the earliest papers that 
focus on learning metrics of the form ([Hi. In particular they try to learn matrix T in an supervised way by 
expressing relationships between pairs of samples. Representative paper on unsupervised metric learning 
for text documents is |[T4l which learns a metric on the simplex based on the geometric volume of the data. 

We focus in this paper on visualizing a corpus of text documents using a 2-D scatter plot. While this is 
perhaps the most popular and practical text visualization technique, other methods such as [20], fTOl, f9l, 
|[l6l, fTl, |fl5| exist. It is conceivable that the techniques developed in this paper may be ported to enhance 
these alternative visualization methods as well. 
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Figure 1: An example of a decomposition H = RD in the case of two word clusters {I'l, i'2, ^3}, {^4, U5}. 
The block diagonal elements in R represent the fact that words are mostly mapped to themselves, but 
sometimes are mapped to other words in the same cluster. The diagonal matrix represents the fact that 
the first cluster is somewhat more important than the second cluster for the purposes of dimensionality 
reduction. 

3 Non-Euclidean Geometries 

Dimensionality reduction methods often assume, either explicitly or implicitly, Euclidean geometry. For 
example, PCA minimizes the reconstruction error for a family of Euclidean projections. LLE uses the 
Euclidean geometry as a local metric. t-SNE is based on a neighborhood structure, determined again by 
the Euclidean geometry. The generic nature of the Euclidean geometry makes it somewhat unsuitable for 
visualizing text documents as the relationship between words conflicts with Euclidean orthogonality. We 
consider in this paper several alternative geometries of the form ©or ([2]) which are more suited for text and 
compare their effectiveness in visualizing documents. 

As mentioned in Section [T]ff smoothes the tf vector x by mapping the observed words into observed 
and non-observed (but related) words. Decomposing H = R x D into a product of a Markov morphisrrQ 
R G R"^" and a non-negative diagonal matrix D € R"^", we see that the matrix H plays two roles: 
blending related vocabulary words (realized by R) and emphasizing some words over others (realized by 
D). The j-th column of R stochastically smoothes word wj into related words Wi where the amount of 
smoothing is determined by Rij . Intuitively Rij is high if Wi , Wj are similar and if they are unrelated. The 
role of the matrix D is to emphasize some words over others. For example. Da values corresponding to 
content words may be higher than values corresponding to stop words or less important words. 

It is instructive to examine the matrices R and D in the case where the vocabulary words cluster ac- 
cording to some meaningful way. Figure [T] gives an example where vocabulary words form two clusters. 
The matrix R may become block-diagonal with non-zero elements occupying diagonal blocks representing 
within-cluster word blending, i.e., words within each cluster are interchangeable to some degree. The diag- 
onal matrix D represents the importance of different clusters. The word clusters are formed with respect to 
the visualization task at hand. For example, in the case of visualizing the sentiment content of reviews we 
may have word clusters labeled as "positive sentiment words", "negative sentiment words" and "objective 
words". In general, the matrices R, D may be defined based on the language or may be specific to document 
domain and visualization purpose. It is reasonable to expect that the words emphasized for visualizing top- 
ics in news stories might be different than the words emphasized for visualizing writing styles or sentiment 
content. 

The above discussion remains valid when H € ]^»"X" for rn being the number of topics in the set of 
documents. In fact, the j-th column of R now stochastically maps word j to related topics i. 

' a non-negative matrix whose columns sum to 1 [4] 
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Applying the geometry ([Til or (|2l) to dimensionality reduction is easily accomplished by first mapping 
documents x i— Hx and proceeding with standard dimensionality reduction techniques such as PCA or 
t-SNE. The resulting dimensionality reduction is Euclidean in the transformed space but non-Euclidean in 
the original space. 

In many cases, the vocabulary contains tens of thousands of words or more making the specification of 
the matrices R,D a. complicated and error prone task. We describe in the next section several techniques 
for specifying R, D in practice. Note, even if in some cases R, D are obtained indirectly by decomposing T 
into H, the discussion of the role of R, D is still of importance as the matrices can be used to come up 
word clusters whose quality may be evaluated manually based on the visualization task at hand. 

4 Domain Knowledge 

We consider four different techniques for obtaining the transformation matrix H. Each technique approaches 
in one of two ways: (1) separately obtain the column stochastic matrix R which blends different words and 
the diagonal matrix D which determines the importance of each word; (2) estimate the semantic similarity 
matrix T and decompose it as H. To ensure that H is a non-negative matrix for it to be interpretable, 
non-negativity matrix factorization techniques such as the one in [Tj may be applied. 

Method A: Manual Specification 

In this method, an expert user manually specifies the matrices {R, D) based on his assessment of the re- 
lationship among the vocabulary words. More specifically, the user first constructs a hierarchical word 
clustering that may depend on the current text domain, and then specifies the matrices (i?, D) with respect 
to the cluster membership of the vocabulary. 

Denoting the clusters by Ci, . . . , (a partition of {vi, . . . ,Vn}), the user specifies R by setting the 
values 




i = j, Vi G Ca 
i 7^ j,Vi G Ca,Vj G Cb 



appropriately. The values pa and paa together determine the blending of words from the same cluster. The 
value pab, a b captures the semantic similarity between two clusters. That value may be either computed 
manually for each pair of clusters or automatically from the clustering hierarchy (for example pab can be the 
minimal number of tree edges traversed to move from a to b). The matrix R is then normalized appropriately 
to form a column stochastic matrix. The matrix D is specified by setting the values 

Da = da, Vi(£Ca (4) 

where da may indicate the importance of word cluster Ca to the current visualization task. We emphasize 
that as with the rest of the methods in this paper, the manual specification is done without access to labeled 
data. 

Since manual clustering assumes some form of human intervention, it is reasonable to also consider 
cases where the user specifies {R, D) in an interactive manner. That is, the expert specifies an initial clus- 
tering of words and {R, D), views the resulting visualization and adjusts his selection interactively until he 
is satisfied. 
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Method B: Contextual Diffusion 



An alternative technique which performs substantially better is to consider a transformation based on the 
similarity between the contextual distributions of the vocabulary words. The contextual distribution of word 
V is defined as 



qv{w) = p{w appears in x\v appears in x) (5) 

where x is a randomly drawn document. In other words is the distribution governing the words appearing 
in the context of word v. 

A natural similarity measure between distributions is the Fisher diffusion kernel proposed by fTT]. Ap- 
plied to contextual distributions as in [6] we arrive at the following similarity matrix (where c > 0) 



T{u,v) 



exp ^-carccos^ \/ qu{w)qv{w)^ ^ 



Intuitively, the word u will be translated or diffused into v depending on the geometric diffusion between 
the distributions of likely contexts. 

We use the following formula to estimate the contextual distribution from a corpus of documents 

qw{y) = ''^^^p{u,x'\w) = y~^p(tt|x^, w)p{x'\w) 

x' x' 

^tf(n,x')tf('u;,x') ) 



Ex' tf(w,x') 



where tf{w, x) is the number of times word w appears in document x. The contextual distribution q^ or the 
diffusion matrix T above may be computed in an unsupervised manner without need for labels. 



Method C: Web n-Grams 

The contextual distribution method above may be computed based on a large collection of text documents 
such as the Reuters RCVl dataset. The estimation accuracy of the contextual distribution increases with the 
number of documents which may not be as large as required. An alternative is to estimate the contextual 
distributions q^ from the entire n-gram content of the web. Taking advantage of the publicly available 
Google n-gram dataseH we can leverage the massive size of the web to construct the similarity matrix T. 
More specifically, we compute the contextual distribution by altering Q to account for the proportion of 
times two words appear together within the n-grams (we used n = 3 in our experiments). 



Method D: Word-Net 

The last method we consider uses Word-Net, a standard linguistic resource, to specify the matrix T in ([Til. 
This is similar to manual specification (method A) in that it builds on expert knowledge rather than corpus 

^The Google n-gram dataset contains n-gram counts (n < 5) obtained from Google based on processing over a trillion words 
of running text. 
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statistics. In contrast to method A, however, Word-Net is a carefully built resource containing more accurate 
and comprehensive linguistic information such as synonyms, hyponyms and holonyms. On the other hand, 
its generality puts it at a disadvantage as method A may be used to construct a geometry suited to a specific 
text domain. 

We follow m who compare five similarity measures between words based on Word-Net. In our experi- 
ments we use Jiang and Conrath's measure llTTl (see also US) 

Tci,c2 = log- 



2p(lcs(ci,C2)) 



as it was shown to outperform the others. Above, Ics stands for the lowest common subsumer, that is, the 
lowest node in the hierarchy that subsumes (is a hypemym of) both ci and C2- The quantity p{c) is the 
probability that a randomly selected word in a corpus is an instance of the synonym set that contains word 
c. 



Convex Combinations 

In addition to methods A-D which constitute "pure methods" we also consider convex combinations 

H* = ^aiH, a, >0, = 1 (7) 

i i 

where Hi are matrices from methods A-D, and a is a non-negative weight vector which sums to 1. Equa- 
tion |7] allows to combine heterogeneous types of domain knowledge (manually specified such as method 
A and D and automatically derived such as methods B and C). Doing so leverages their diverse nature and 
potentially achieving higher performance than each of the methods A-D on its own. 



5 Experiments 

We evaluated methods A-D and the convex combination method by experimenting on two datasets from 
different domains. The first is the Cornell sentiment scale dataset of movie reviews [17|. The visualization 
in this case focuses on the sentiment quantity [18 |. For simplicity, we only kept documents having sentiment 
level 1 (very bad) and 4 (very good). Preprocessing included lower-casing, stop words removal, stemming, 
and selecting the most frequent 2000 words. Alternative preprocessing is possible but should not modify 
the results much as we focus on comparing alternatives rather than measuring absolute performance. The 
second text dataset is 20 newsgroups. It consists of newsgroup articles from 20 distinct newsgroups and is 
meant to demonstrate topic visualization. 

To measure the dimensionahty reduction quality, we display the data as a scatter plot with different data 
groups (topics, sentiments) displayed with different markers and colors. Our quantitative evaluation is based 
on the fact that documents belonging to different groups (topics, sentiments) should be spatially separated 
in the 2-D space. Specifically, we used the following indices to evaluate different reduction methods and 
geometries. 

(i) The weighted intra-inter measure is a standard clustering quality index that is invariant to non-singular 
linear transformations of the embedded data. It equals to tr S^^Sw where Sw is the within-cluster 
scatter matrix, St = Sw + Sb is the total scatter matrix, and Sb is the between-cluster scatter matrix 

m. 
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Vocabulary 




Figure 2: Manually specified hierarchical word clustering for the 20 newsgroup domain. The words in the 
frames are examples of words belonging to several bottom level clusters. 

(ii) The Davies Bouldin index is an alternative to (i) that is similarly based on the ratio of within-cluster 

scatter to between-cluster scatter 15 ]. 

(iii) Classification error rate of a A;-NN classifier that applies to data groups in the 2-D embedded space. 

Despite the fact that we are not interested in classification per se (otherwise we would classify in the 
original high dimensional space), it is an intuitive and interpretable measure of cluster separation. 

(iv) An alternative to (iii) is to project the embedded data onto a line which is the direction returned by 

applying Fisher's linear discriminant analysis |^ to the embedded data. The projected data from 
each group is fitted to a Gaussian whose separation is used as a proxy for visualization quality. In 
particular, we summarize the separation of the two Gaussians by measuring the overlap area. While 
(iii) corresponds to the performance of a fc-NN classifier, method (iv) corresponds to the performance 
of Fisher's LDA classifier. 

Note that the above methods (i)-(iv) make use of labeled information to evaluate visualization quality. The 
labeled data, however, is not used during the dimensionality reduction stages justifying their unsupervised 
behavior. 

The manual specification of domain knowledge (method A) for the 20 newsgroups domain used matrices 
R, D that were specified interactively based on the (manually obtained) word clustering in Figure [2l In the 
case of sentiment data the manual specification consisted of paititioning words into positive, negative or 
neutral sentiment based on the General Inquirer resource^. The matrix H was completed by assigning large 
weights {Da) for negative and positive words and small weights {Da) to neutral words. 

The contextual diffusion (method B) was computed from a large external corpus (Reuters RCVl) for the 
newsgroups domain. For the sentiment domain we used movie reviews authored by other critics. Google 
n-gram (method C) provided a truly massive scale resource for estimating the contextual diffusion. In the 
case of Word-Net (method D) we used Ted Pedersen's implementation of Jiang and Conrath's similarity 

''http://www.wjh.harvard.edu/~inquirer/ 
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PCA(l) PC A (2) 


t-SNE(l) t-SNE(2) 


H = 1 
B 
C 
D 


1.5391 1.4085 
1.2570 1.3036 
1.2023 1.3407 
1.4475 1.3352 


1.1649 1.1206 
1.2182 1.2331 
0.7844 1.0723 

1.1762 1.1362 




PCA(l) PC A (2) 


t-SNE(l) t-SNE(2) 


H = 1 
B 
C 
D 


0.8461 0.5630 
0.7381 0.6815 
0.8420 0.5898 
0.8532 0.5868 


0.9056 0.7281 
0.9110 0.6724 
0.9323 0.7359 
0.9013 0.7728 



Table 1: Quantitative evaluation of dimensionality reduction for visualization for two tasks in the news 
article domain. The numbers in the top five rows correspond to measure (i) (lower is better), and the numbers 
in the bottom five rows correspond to measure (iii) {k = 5) (higher is better). We conclude that contextual 
diffusion (B), Google n-gram (C), and Word-Net (D) tend to outperform the original H = I. 

measur^ Note, for method C and D, the resulting matrix H is not domain specific but rather represents 
general semantic relationships between words. 

In our experiments below we focused on two dimensionality reduction methods: PC A and t-SNE. PC A 
is a well known classical method while t-SNE ll22ll is a recently proposed technique shown to outperform 
LLE, CCA, MVU, Isomap, and Laplacian eigenmaps. Indeed it is currently considered state-of-the-art for 
dimensionality reduction for visualization purposes. 

Figures |3]displays qualitative and quantitative evaluation of PC A and t-SNE for the sentiment and news- 
group domains with standard H = I geometry (left column), manual specification (middle column) and 
contextual diffusion (right column). Generally, we conclude that in both the newsgroup domain and the sen- 
timent domain and both qualitatively and quantitatively (using the numbers in the top two rows), methods A 
and B perform better than using the original geometry H = I with method B outperforming method A. 

Tables |3ll] display two evaluation measures for different types of domain knowledge (see previous sec- 
tion). Table [3] corresponds to the sentiment domain where we conducted separate experiment for four movie 
critics. Table [T] corresponds to the newsgroup domain where two tasks were considered. The first involving 
three newsgroups (classes comp.sys.mac. hardware, rec. sports. hockey and talk.politics.mideast) and the sec- 
ond involving four newsgroups (rec.autos, rec.motocycles, rec.sports.baseball and rec.sports.hockey). We 
conclude from these two figures that the contextual diffusion, Google n-gram, and Word-Net generally out- 
perform the original H = I matrix. The best method varies from task to task but the contextual diffusion 
and Google n-gram seem to have the strongest performance overall. 

We also examined convex combinations 

aiHA + a2HB + a^Hc + a^Ho (8) 

with Y^Ui = 1 and a, > 0. Table |2]displays three evaluation measures, the weighted intra-inter measure (i), 
the Davies-Bouldin index (ii), and the k-NN classifier (k = 5) accuracy on the embedded documents (iii). 
The beginning of the section provides more information on these measures. The first four rows correspond 
to the "pure" methods A,B,C,D. The bottom row correspond to a convex combination found by minimizing 
the unsupervised evaluation measure (ii). Note that the convex combination found also outperforms A, B, C, 
and D on measure (i) and more impressively on measure (iii) which is a supervised measure that uses labeled 
data (the search for the optimal combination was done based on (ii) which does not require labeled data). 

"'http://wn-similarity.sourceforge.net/ 
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Figure 3: Qualitative evaluation of dimensionality reduction for the sentiment domain (top two rows) 
and the newsgroup domain (bottom two rows). The first and the third rows display PCA reduction while 
the second and the fourth display t-SNE. The left column correspond to no domain knowledge (H = I) 
reverting PCA and t-SNE to their original form. The middle column corresponds to manual specification 
(method A). The right column corresponds to contextual diffusion (method B). Different groups (sentiment 
labels or newsgroup labels) are marked with different colors and marks. 

In the sentiment case (top two rows) the graphs were rotated such that the direction returned by applying 
Fisher linear discriminant onto the projected 2D coordinates aligns with the positive x-axis. The bell curves 
are Gaussian distributions fitted from the x-coordinates of the projected data points (after rotation). The 
numbers displayed in each sub-figure are computed from measure (iv). 
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(qi, a2, as, 0:4) 


(i) 


(ii) 


(iii) (k=5) 


(1,0,0,0) 


0.5756 


-3.9334 


0.7666 


(0,1,0,0) 


0.5645 


-4.6966 


0.7765 


(0,0,1,0) 


0.5155 


-5.0154 


0.8146 


(0,0,0,1) 


0.6035 


-3.1154 


0.8245 


(0.3,0.4,0.1,0.2) 


0.4735 


-5.1154 


0.8976 



Table 2: Three evaluation measures (i), (ii), and (iii) (see the beginning of the section for description) for 
convex combinations ^ using different values of a. The first four rows represent methods A, B, C, and 
D. The bottom row represents a convex combination whose coefficients were obtained by searching for 
the minimizer of measure (iii). Interestingly the minimizer also performs well on measure (i) and more 
impressively on the labeled measure (iii). 





Dennis Schwartz 
PCA t-SNE 


James Berardinelli 
PCA t-SNE 


Scott Renshaw 
PCA t-SNE 


Steve Rhodes 
PCA t-SNE 


H = 1 
A 
B 
C 


1.8625 1.8781 
1.8474 1.7909 
1.4254 1.5809 

1.6868 1.7766 


1.4704 1.5909 
1.3292 1.4406 
1.3140 1.3276 

1.3813 1.4371 


1.8047 1.9453 
1.6520 1.8166 
1.5133 1.6097 

1.7200 1.8605 


1.8013 1.8415 
1.4844 1.6610 
1.5053 1.6145 
1.7750 1.7979 


H = 1 
A 
B 
C 


0.6404 0.7465 
0.6011 0.7779 
0.8831 0.8554 

0.7238 0.7981 


0.8481 0.8496 
0.9224 0.8966 
0.9188 0.9377 
0.8871 0.9093 


0.6559 0.6821 
0.7424 0.7411 
0.8215 0.8332 
0.6897 0.7151 


0.6680 0.7410 
0.8350 0.8513 
0.8124 0.8324 
0.6724 0.7726 



Table 3: Quantitative evaluation of dimensionality reduction for visualization in the sentiment domain. 
Each of the four columns corresponds to a different movie critic from the Cornell dataset (see text). The top 
five rows correspond to measure (i) (lower is better) and the bottom five rows correspond to measure (iii) 
(k = 5, higher is better). Results were averaged over 40 cross validation iterations. We conclude that all 
methods outperform the original H = I with the contextual diffusion and manual specification generally 
outperforming the others. 
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Figure 4: Qualitative evaluation of dimensionality reduction for the ACL dataset using t-SNE. Left: no 
domain knowledge (H = I); Middle: manual specification (method A); Right: contextual diffusion (method 
B). Each document is labeled by its assigned id from ACL anthology. 



We conclude that combining heterogeneous domain knowledge may improve the quality of dimensionality 
reduction for visualization, and that the search for an improved convex combination may be accomplished 
without the use of labeled data. 

Finally, we demonstrate the effect of linguistic geometries on a new dataset that consists of all oral pa- 
pers appearing in ACL 2001 - 2009. For the purpose of manual specification, we obtain 1545 unique words 
from paper titles, and assign each word relatedness scores for each of the following clusters: morphol- 
ogy/phonology, syntax/parsing, semantics, discourse/dialogue, generation/summarization, machine transla- 
tion, retrieval/categorization and machine learning. The score takes value from to 2, where 2 represents 
the most relevant. The score information is then used to generate the transformation matrix R. We also 
assign each word an importance value ranging from to 3 (larger the value, more important the word). This 
information is used to generate the diagonal matrix D. Figure |4] shows the projection of all 2009 papers 
using t-SNE (papers from 2001 to 2008 are used to estimate contextual diffusion). The manual specification 
improves over no domain knowledge by separating documents into two clusters. By examining the docu- 
ment id, we find that all papers appearing in the smaller cluster correspond to either machine translation or 
multihngual tasks. Interestingly, the contextual diffusion results in a one-dimensional manifold. 



6 Discussion 

In this paper we introduce several ways of incorporating domain knowledge into dimensionality reduc- 
tion for visualization of text documents. The novel methods of manual specification, contextual diffusion, 
Google n-grams, and Word-Net all outperform in general the original assumption H = I. We emphasize 
that the baseline H = I is the one currently in use in most text visualization systems. The two reduction 
methods of PCA and t-SNE represent a popular classical technique and a recently proposed technique that 
outperforms other recent competitors (LLE, Isomap, MVU, CCA, Laplacian eigenmaps). 

Our experiments demonstrate that different domain knowledge methods perform best in different situa- 
tions. As a generalization, however, the contextual diffusion and Google n-gram methods had the strongest 
performance. We also demonstrate how combining different types of domain knowledge provides increased 
effectiveness and that such combinations may be found without the use of labeled data. 
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